Read me for: Clark, Jesse, John A. Curiel and Tyler S. Steelman. "Minmaxing of Bayesian Improved Surname and Geography Level Ups in Predicting Race."

This replication directory contains the necessary script and data to compute the non-personal identifying information (pii) results of our paper. The files are divided into the following: codefiles, data files, and results. Please direct any questions about these files to John A. Curiel at jcuriel@mit.edu. 

NOTE: The personal identifying information (pii) has been removed from these replication files. This includes the coordinate data from the Georgia voterfile, necessary for BISG using the wru package. These estimates are read in from the confidential pii related script.  


For the purposes of replication, our primary computer for the non-pii scripting used a desktop computer with the following specifications: Intel(R) Core(TM)
i7-9700 CPU 6-Core/12-Thread, 12MB Cache, 3.0 GHz, 64GB DDR4 Memory XMP at 2933MHz, 2TB HDD.

Note: For the replication code to work, it is important to download all the data files from the dataverse in their original file formats (e.g., RData files for the files marked below with that extension)

Code files:

ga_voterfile_wru_pii1.R : This file computes the BISG predictions using the combination of the Georgia voter file, geocoded unique addresses, and base wru package. The estimated time to complete using the aforementioned computer is approximately 2.5 hours. 
packages read in:  foreign (v 0.8-80) , rgdal (v 1.5-15) , sp (v 1.4-2), dplyr (v 1.0.1), wru (v 0.1-9), stringi (v 1.4.6), stringr (v 1.4.0), tidyverse (v 1.3.0), data.table (v 1.13.0)
 
ga_voterfile_wru_zip1.R : This file performs BISG at the ZIP code level, in addition to outputting the table and figures in both the manuscript and supplementary materials.  
packages read in: devtools (v 2.3.2), foreign (v 0.8-80), rgdal (v 1.5-15), sp (v 1.4-2), dplyr (v 1.0.1), wru (v 0.1-9), ggplot2 (v 3.3.2), gridExtra (v 2.3), stringi (v 1.4.6), stringr (v 1.4.0), tidyverse (v 1.3.0), ggpubr (v 0.4.0),
svMisc (v 1.1.0), data.table (v 1.13.0), zipWRUext2 (v 0.0.0.9000)

NOTE: The zipWRUext2 is currently from github, and needs to be installed using a devtools associated command, which is as follows: 
devtools::install_github("https://github.com/jcuriel-unc/zipWRUext",subdir="zipWRUext2")


Data Files:

ga_geocoded_all.csv: The csv of all the geocoded addresses from the ESRI 2013 street geocoder. Part of the pii data and not included. Includes the following fields
	ObjectID - The unique polygon ID 
	Loc_name - The Geolocator used to produce the coordinates
	Score - Confidence in the coordinates attained via fuzzymatching from the geocoder, 0= not at all confident, 100=certainty 
	StName - The streetname of the address
	X - The longitude
	Y - The latitude
	full_addr - The full address, used to merge onto the voterfile
ga_voterfile.csv: The csv of the Georgia voterfile. Part of the pii data and not included.  Includes the following fields of interest (i.e. are kept)
	full_addr - The full address, used to merge onto the unique address coordinates
	county_code - The internal county code for a county, established by the state of Georgia
	last_name - The surname of the voter, to be changed to "surname"
	race - the code reflecting the self reported race of the individual 
	race_desc - The spelled out self-reported racial description of the voter
	gender - The self reported gender of the voter
	residence_zipcode - The residential ZIP code of the voter. 
	
tl_2010_13_tabblock10 (.shp, .dbf, .prj, .shx): The Census block shapefile from the U.S. Census TIGER Shapefiles. Part of the pii data and not included. Includes the following fields of interest: 
	BLOCKCE10 - The block FIPs extension of state, county, tract, and block IDs 
	TRACTCE10 - The tract FIPs extension of state, county, tract, and block IDs 
	COUNTYFP10 - The county FIPs extension of state, county, tract, and block IDs

sampled_df_geocoded01222021B.csv: The saved and pii removed data frame of the sampled addresses from the GA voterfile for the purposes of Table 3 in the appendix. Inlcudes the following fields: 
	ObjectID: The point object unique ID for each observation. 
	Loc_name: The ESRI locator used to geocode the address. 
	Status: Status of the geocoder output: T or M 
	Score: Accuracy score for the geocoded address on a 0 - 100 scale, with 100 reflecting greater confidence. Used to determine which is the best geocoder for a given address. 
	StName: Street name for the address geocoded. 
	X: X coordinates 
	Y: Y coordinates 
	county_cod: Georgia's internal county fips code. 

ga_geocoded_anon.csv: The ga_geocoded_all.csv without the associated addresses. Used to create table 2 in the appendix. Includes the following fields:
	ObjectID - The point object unique ID for each observation. 
	Loc_name - The ESRI locator used to geocode the address. 
	Score - Accuracy score for the geocoded address on a 0 - 100 scale, with 100 reflecting greater confidence. Used to determine which is the best geocoder for a given address. 
	StName - Street name for the address geocoded.             
	X - X coordinates 
	Y - Y coordinates 

ga_geocoded_blk.rds: The saved rds file of the dbf section of the Georgia block data in the event that the user doen't want to have to re-project and re-read the data in again in the event of a crash. Part of the pii data and not included.  

ga_voterfile_geocoded.rds: a save state of the Georgia voter file with wru predicted at the block level, as conducted on lines 60 - 64 of the pii script. 

time_list.rds: A saved list object of how long it took to run the census block section of the script. 

seeds_vec.rds: A vector of numeric values to set the seed for the resampling. 

ga_voterfile_geocoded_postpii.rds: The outputted GA voterfile with preictions of race at the block, tract, county, and surname level. All pii are removed, and the first file read into the script, ga_voterfile_wru_zip1.R. The fields are as follows: 
	county_code - The internal county code for a county, established by the state of Georgia
	surname - The last name of the voter
	race - the code reflecting the self reported race of the individual 
	race_desc - The spelled out self-reported racial description of the voter
	gender - The self reported gender of the voter
	residence_zipcode - The residential ZIP code of the voter.
	pred.whi_ - predictions on 0 - 1 scale of the probability that a voter is white. Predicted at levels 	of block, tract, county, and surname alone. 
	pred.bla_ - predictions on 0 - 1 scale of the probability that a voter is black. Predicted at levels 	of block, tract, county, and surname alone. 
	pred.his_ - predictions on 0 - 1 scale of the probability that a voter is hispanic. Predicted at 	levels of block, tract, county, and surname alone. 
	pred.asi_ - predictions on 0 - 1 scale of the probability that a voter is asian. Predicted at levels 	of block, tract, county, and surname alone. 
	pred.oth_ - predictions on 0 - 1 scale of the probability that a voter is of some other race that 		is not one of the above. Predicted at levels of block, tract, county, and surname alone. 

ga_voterfile_geocoded_postpii_final1.rds: The non-pii Georgia voterfile with the ZIP code level predictions for 2010 and 2018. The additional fields are as follows:
	pred.____zip2018 - predictions for the races white (whi), black (bla), hispanic (his) asian (his), 	and other (oth) using the 2018 ACS racial demographic data. 
	pred.____zip2010 - predictions for the races white (whi), black (bla), hispanic (his) asian (his), 		and other (oth) using the 2010 decenniel Census racial demographic data. 

ga_voterfile_geocoded2sample_set.rds: The non-pii Georgia voterfile, with entries excluded where the geographic information for ZIP code or block are missing. Is used to sample from for the rest of the script. 

Locators: To be input into the "Input Address Locator Field" as part of the ESRI Geocoding Address toolbox

Street_Addresses_US (.loc,.lox, .loc.xml): The ESRI US street address locator necessary for geocoding. 

Postal_US (.loc,.lox, .loc.xml): The ESRI US postal address locator necessary for geocoding. 

MANUSCRIPT OUTPUTS 

Figure 1 = 
races_density_plots_countA.png - The ggplot exported png of the absolute count difference of the block, county and ZCTA 2018 estimates from the empirical results for each of the wru five racial groupings, except other. 


SUPPLEMENTARY MATERIALS OUTPUTS 

Table 2 = 
appendix_table2.csv - The collapsed results of the sampled proportions of addresses geocoded by the ESRI street and postal geocoders. 

Table 3 = 
geocoded_output_col2.csv - The collapsed results of the sampled proportions of addresses geocoded by the ESRI classic suite of geocoders. 

Table 4 = 
race_error_table95ci.csv - The table of the percent differences at the 95th percent confidence interval between the empirical and BISG estimates for the five races at each level of geography. 

Figure 3 = 
races_density_plots.png - The ggplot exported png of the percent difference of the block, county and ZCTA 2018 estimates from the empirical results for each of the wru five racial groupings. 


ALL OTHER OUTPUTS 

resampled_race_diffs_wide.rds: The resampled dataset of the respective numbers of a given race given a draw of 1000, and the differences between the empirical and predicted results. The fields are as follows: 
	empirical: The actual sum of a reported race from the Georgia voterfile given a draw of 1000. 
	surname: The wru surname predicted sum of a race from given a draw of 1000. 
	county: The county surname predicted sum of a race from given a draw of 1000. 
	zip2010: The zipWRUext2 ZIP Code 2010 census predicted sum of a race from given a draw of 1000. 
	zip2018: The zipWRUext2 ZIP Code 2018 census predicted sum of a race from given a draw of 1000. 
	tract: The wru tract predicted sum of a race from given a draw of 1000. 
	blocks: The wru blocks predicted sum of a race from given a draw of 1000.
	race: The race resampled/predicted for a given row.
	surname_diff: The percentage point difference between the empirical and surname estimates. 
	county_diff: The percentage point difference between the empirical and county estimates. 
	zip2010_diff: The percentage point difference between the empirical and zip2010 estimates. 
	zip2018_diff: The percentage point difference between the empirical and zip2018 estimates. 
	tract_diff: The percentage point difference between the empirical and tract estimates. 
	blocks_diff: The percentage point difference between the empirical and blocks estimates.
	surname_diff_count: The absolute difference between the empirical and surname estimates. 
	county_diff_count: The absolute  difference between the empirical and county estimates. 
	zip2010_diff_count: The absolute  difference between the empirical and zip2010 estimates. 
	zip2018_diff_count: The absolute  difference between the empirical and zip2018 estimates. 
	tract_diff_count: The absolute  difference between the empirical and tract estimates. 
	blocks_diff_count: The absolute  difference between the empirical and blocks estimates. 

resampled_race_diffs_long.rds: The long reshaped version of the percentage pont differences by method alone. the fields are as follows:
	race: The eace grouping 
	method: The method to predict the difference from the empirical results. Includes, "surname_diff" 	"county_diff"  "zip2010_diff" "zip2018_diff" "tract_diff"   "blocks_diff" 
	abs_diff: The percentage point difference from the empirical results. 



races_density_plots_count.png: The ggplot exported png of the absolute count difference of the block, county and ZCTA 2018 estimates from the empirical results for each of the wru five racial groupings.


race_error_table_count95ci.rds: The table of the absolute count differences at the 95th percent confidence interval between the empirical and BISG estimates for the five races at each level of geography. 
  

OTHER FILES

ga_voterfile_basic_geocoding_results01162021.png: A screenshot of the time necessary to geocode the Georgia voterfile unique addresses with the ESRI 2013 street address and postal geocoder. Presented in Figure 1 of the appendix. 


sample_geocoding_time.png: The screenshot of the ArcGIS output of the time taken to process the address sample with the ESRI classic suite of geocoders. Presented in Figure 2 of the supplementary materials. 

 

 